Anyone familiar working with ICE data knows that the agency is notorious for releasing dirty data filled with pesky errors. For example, consider the Krome Service Processing Center, the very first immigration processing center established in the 1980’s after the closure of Ellis Island in 1954. Of all the facilities the agency should name consistently, Krome the very first detention center of the contemporary system, one would expect ICE to get right. However, the facility inspections page lists the following names for the Krome Service Processing Center:
Every instance is different. In one case “Special” is replaced for “Service.” In another case the word processing is misspelled “Procesing,” while in a third, the title is abbreviated. This is one example of many. In other cases facilities are sometimes named “Detention Center” rather than “Detention Facility,” etc. Assembling the data into a useful table that can be sorted, filtered, and summarized requires standardizing the names.
From early on, Craig decided to assemble the data as reported, errors and all, and perform standardization programmatically. Doing this facilitates reproducibility. The project’s primary cleaning functions are incorporated into a custom function clean_facility_names() and moved to a separate R file. This allows more than 100 lines of code to be collapsed into a single function call. However, for display purposes, the lines of code that make up the transformations are detailed on this page.
# Load necessary libraries
library(googlesheets4)
library(readr)
library(tidyverse)
library(janitor)
library(lubridate)
During the time that the students were assembling data, we worked out of a Google Sheet as this permitted each of us to have simultaneous access to the data. For preliminary analysis, a data frame is constructed by reading from the sheet.
# Read in the sheet
df_inspect <- read_sheet("https://docs.google.com/spreadsheets/d/1im5VSi3bIEi13O8WQ56wEIXSyNEstbGMylXXgD9bAG0/edit#gid=1858227071",
sheet="Inspections",
col_names = TRUE,
col_types = "c") %>%
clean_names()
Sometimes facility names are repeated. Use of the dplyr function distinct is useful for isolating issues with field names [@R-dplyr].
df_inspect %>%
distinct(facility, .keep_all = TRUE) %>%
arrange(facility)
The following procedures detail the main facility name cleaning operations. This block of code was converted to a custom function and written to an R file. By doing this, the entire following code block can be called with a single function, or incorporated into an analysis specific pipeline. The cleaning procedures are provided here to show the transformations, but the separate function file can be downloaded directly. Heavy use is made of the dplyr library [@R-dplyr] mutate function in combination with the stringr library [@R-stringr] functions str_replace_all and replace functions. All of these are part of the tidyverse [@R-tidyverse] family of libraries.
df_inspect <- df_inspect %>%
# Removing a curious whitespace character that appears
# in more recent facility names
mutate(facility = str_replace_all(facility,
pattern = " ",
replacement = " ")) %>%
separate(.,
col = facility,
into = c("facility","inspection_date"),
sep = "\\) - ") %>%
separate(.,
col = facility,
into = c("facility","state"),
sep = "\\(") %>%
# Deal with issues in the file names, state, and inspection date
mutate(facility = str_trim(facility, side = "both"),
state = str_trim(state, side = "both"),
facility = str_replace_all(facility,
pattern = "^Adelanto ICE Processing Center-East",
replacement = "Adelanto ICE Processing Center - East"),
facility = str_replace_all(facility,
pattern = "^Adelanto ICE Processing Center-West",
replacement = "Adelanto ICE Processing Center - West"),
facility = str_replace_all(facility,
pattern = "^Allen Parish Detention Facility",
replacement = "Allen Parish Public Safety Complex"),
facility = str_replace_all(facility,
pattern = "^Berks County Residential Center",
replacement = "Berks Family Residential Center"),
inspection_date = replace(inspection_date, inspection_date=="Jan. 29 - 31, 2019", "Jan. 31, 2019"),
facility = str_replace_all(facility,
pattern = "^Bristol County Jail$",
replacement = "Bristol County Jail and House of Correction"),
facility = str_replace_all(facility,
pattern = "^Buffalo$",
replacement = "Buffalo Batavia Service Processing Center"),
state = replace(state, state == "Batavia) Service Processing Center", "NY"),
facility = str_replace_all(facility,
pattern = "^Calhoun County Jail",
replacement = "Calhoun County Correctional Center"),
facility = str_replace_all(facility,
pattern = "^Clay County Justice Center",
replacement = "Clay County Jail"),
facility = str_replace_all(facility,
pattern = "^Coastal Bend Detention Facility",
replacement = "Coastal Bend Detention Center"),
state = replace(state, state == "David L. Moss Criminal Justice Center)", "OK"),
facility = str_replace_all(facility,
pattern = "^Dodge County Detention Center",
replacement = "Dodge County Detention Facility"),
facility = str_replace_all(facility,
pattern = "^Donald W. Wyatt Detention Center",
replacement = "Donald W. Wyatt Detention Facility"),
facility = str_replace_all(facility,
pattern = "^Essex County Corrections Facility",
replacement = "Essex County Correctional Facility"),
facility = str_replace_all(facility,
pattern = "^Farmville Detention Center$",
replacement = "Immigration Centers of America - Farmville"),
facility = str_replace_all(facility,
pattern = "^Florence SPC",
replacement = "Florence Service Processing Center"),
facility = str_replace_all(facility,
pattern = "^Houston CDF",
replacement = "Houston Contract Detention Facility"),
state = replace(state, state == "Polk)", "TX"),
facility = str_replace_all(facility,
pattern = "^Immigration Centers of America$",
replacement = "Immigration Centers of America - Farmville"),
state = replace(state, state =="ICA", "VA"),
inspection_date = replace(inspection_date,
inspection_date == "Farmville Detention Center (FDC) (VA",
"Feb. 24, 2021"),
# This one picks a value in the facility col and changes a value in the state col
state = replace(state, facility == "Immigration Centers of America - Farmville",
"VA"),
facility = str_replace_all(facility,
pattern = "^Joe Corley Detention Facility",
replacement = "Joe Corley Processing Center"),
facility = str_replace_all(facility,
pattern = "^Karnes County Residential Center",
replacement = "Karnes County Family Residential Center"),
facility = str_replace_all(facility,
pattern = "^Krome SPC",
replacement = "Krome Service Procesing Center"),
facility = str_replace_all(facility,
pattern = "^Krome Special Processing Center",
replacement = "Krome Service Procesing Center"),
state = replace(state, state =="SPC)", "FL"),
facility = str_replace_all(facility,
pattern = "^Mesa Verde Detention Facility",
replacement = "Mesa Verde ICE Processing Facility"),
facility = str_replace_all(facility,
pattern = "^Northwest Contract Detention Center",
replacement = "Northwest ICE Processing Center"),
facility = str_replace_all(facility,
pattern = "^Northwest Detention Center",
replacement = "Northwest ICE Processing Center"),
facility = str_replace_all(facility,
pattern = "^Okmulgee County Jail-Moore Detention Facility",
replacement = "Okmulgee County Jail - Moore Detention Facility"),
facility = str_replace_all(facility,
pattern = "^Orange County Jail",
replacement = "Orange County Correctional Facility"),
facility = str_replace_all(facility,
pattern = "^Otay Mesa Detention Facility",
replacement = "Otay Mesa Detention Center"),
facility = str_replace_all(facility,
pattern = "^Prarieland Detention Center",
replacement = "Prairieland Detention Center"),
facility = str_replace_all(facility,
pattern = "^Richwood Correcrtional Center",
replacement = "Richwood Correctional Center"),
facility = str_replace_all(facility,
pattern = "^Rio Grande Processing Center",
replacement = "Rio Grande Detention Center"),
facility = str_replace_all(facility,
pattern = "^Robert A. Deyton Correctional Center",
replacement = "Robert A. Deyton Detention Facility"),
facility = str_replace_all(facility,
pattern = "^Robert A. Deyton Detention Center",
replacement = "Robert A. Deyton Detention Facility"),
facility = str_replace_all(facility,
pattern = "^South Texas Detention Complex",
replacement = "South Texas ICE Processing Center"),
facility = str_replace_all(facility,
pattern = "^South Texas Processing Center",
replacement = "South Texas ICE Processing Center"),
facility = str_replace_all(facility,
pattern = "^Strafford County Corrections",
replacement = "Strafford County Department of Corrections"),
inspection_date = replace(inspection_date, state =="CO)- Mar. 31, 2021", "Mar. 31, 2021"),
state = replace(state, state == "CO)- Mar. 31, 2021", "CO"),
facility = str_replace_all(facility,
pattern = "^T. Don Hutto Detention Center$",
replacement = "T. Don Hutto Residential Center"),
facility = str_replace_all(facility,
pattern = "Tulsa County Jail - David L. Moss Criminal Jutice Center",
replacement = "David L. Moss Criminal Justice Center"),
facility = str_replace_all(facility,
pattern = "^Tulsa County Jail$",
replacement = "David L. Moss Criminal Justice Center"),
state = replace(state, state == "David L. Moss Justice Center)", "OK"),
facility = str_replace_all(facility,
pattern = "^Washoe County Detention Centerr",
replacement = "Washoe County Detention Center"),
facility = str_replace_all(facility,
pattern = "^Webb County Detention Facility",
replacement = "Webb County Detention Center"),
)